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One of the major drawbacks of the standard pattern-recognition 
approach to isolated word recognition is that poor performance is 
generally achieved for word vocabularies with acoustically similar 
words. This poor performance is related to the pattern similarity 
(distance) algorithms that are generally used in which a global 
distance between the test pattern and each reference pattern is 
computed. Since acoustically similar words are, by definition, glob- 
ally similar, it is difficult to reliably discriminate such words, and a 
high error rate is obtained. By modifying the pattern- similarity 
algorithm so that the recognition decision is made in two passes, we 
can achieve improvements in discriminability among similar words. 
In particular, on the first pass the recognizer provides a set of global 
distance scores which are used to decide a class (or a set of possible 
classes) in which the spoken word is estimated to belong. On the 
second pass we use a locally weighted distance to provide optimal 
separation among words in the chosen class (or classes), and make 
the recognition decision on the basis of these local distance scores. 
For a highly complex vocabulary (letters of the alphabet, digits, and 
three command words), we obtain recognition improvements of from 
3 to 7 percent using the two-pass recognition strategy. 

I. INTRODUCTION 

As illustrated in Fig. 1, the "standard" pattern recognition approach 
to isolated word recognition is a three-step method consisting of 
feature measurement, pattern similarity determination, and a decision 
rule for choosing recognition candidates. This pattern recognition 
model has been applied to a wide variety of word recognition systems 
with great success. 1-8 However, the simple, straightforward approach 
to word recognition, shown in Fig. 1, runs into difficulties for complex 
vocabularies, i.e., vocabularies with phonetically similar words. For 
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Fig. 1 — Block diagram of standard approach to isolated word recognition. 



example, recognition of the vocabulary consisting of the letters of the 
alphabet would have problems with letters in the sets 

fc-{A,J,K}. 

<fc={B,C,D,E,G,P,V,T,Z}, 

<t> 3 - {Q, U}, 

<t>4 = {I, Y), 

<f> 5 - {L, M, N}, 
<fc={F,S,X}. 

Similarly, recognition of the computer terms of Gold 9 might lead to 
confusions among the set containing four, store, and core. In the above 
cases the problems are due to the inherent acoustic similarity (overlap) 
between sets of words in the vocabulary. It should be clear that this 
type of problem is essentially unrelated to vocabulary size (except 
when we approach very large vocabularies), since a large vocabulary 
may contain no similar words (e.g., the Japanese cities list of Itakura 4 ), 
and a small vocabulary may contain many similar words (e.g., the 
letters of the alphabet). 

The purpose of this paper is to propose, discuss, and evaluate a 
modified approach to isolated word recognition in which a two-pass 
method is used. The output of the first recognition pass is an ordered 
set of word classes in which the unknown spoken word is estimated to 
have occurred, and the output of the second pass is an ordered list of 
word candidates within each class obtained from the first pass. The 
computation for the first pass is similar in nature but often reduced in 
magnitude from that required for the standard one-pass word recog- 
nizer. The computation of the second pass consists of using an "opti- 
mally" determined word discriminator to separate words within the 
equivalence class. In Section II, we present the two-pass recognizer, 
and discuss its philosophy and method of implementation. In Section 
III, we give an evaluation of the effectiveness of the two-pass approach 
for a vocabulary consisting of the 26 letters of the alphabet, the 10 
digits, and the command words stop, error, and repeat. Finally, in 
Section IV, we summarize the results and show how they are applicable 
to practical speech recognition systems. < 
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II. THE TWO-PASS RECOGNIZER 

Assume the word vocabulary consists of V words. The ith word, v it 
is represented by the word template R„ i — 1, 2, • • • , V, where each R, 
is a multidimensional feature vector. Similarly, we denote the test 
pattern as T (corresponding to the spoken word q in the vocabulary), 
where T is again a multidimensional feature vector. For simplicity we 
assume that the pattern similarity and distance computation is carried 
out using the "normalize and warp" procedure described by Myers et 
al., 10 and illustrated in Fig. 2. A "standard" word duration of N frames 
is adopted, and each reference pattern is linearly warped to this 
duration. We call the warped reference patterns ft,. Similarly, the test 
pattern is linearly warped to a duration of N frames, yielding the new 
pattern T. A dynamic time-warping alignment algorithm then com- 
putes the "standard" distance 



D(T,ft,) =- I cfft(fc), &<»(*))), 



(1) 



where d(T(k), ft,(/)) is the local distance between frame k of the test 
pattern, and frame I of the ith reference pattern, and w(k) is the time- 
alignment mapping between frame k of the test pattern, and frame 
w(k) of the ith reference pattern. The total distance D of eq. (1) is 
only a function of i. 

We define the local distance of the Ath frame of the test pattern to 
the w(k)th frame of the ith reference pattern as d t (k), where 

di(k) = d(T(k),%(w(k))), (2) 

so D(T, ft,) of eq. (1) can be written as 



Z)(T,R,)=i £ di(k). 



(3) 
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Fig. 2 — Block diagram of the normalize-and-warp procedure for equalizing the 
lengths of words. 



ISOLATED WORD RECOGNITION 741 



If R, corresponds to the correct reference for the spoken word T (i.e., 
i = q), then we would theoretically expect the local distance d q (k) to 
be independent of k, with d assuming values from a x 2 distribution 
with p (eight for the system we are using) degrees of freedom for the 
case where the speech features are those of an lpc model and the log 
likelihood distance measure is used for the local distance. 1112 Thus, if 
we plotted d q (k) versus k, we would expect it to vary around some 
expected value d where 



d = E[d q (k)-\ = E[x]>]. 



(4) 



An example of a typical curve of d q (k) versus k is given in Fig. 3a. 

If we now examine the typical behavior of the curve of di(k) versus 
k when i 5^ q, we see that one of two types of behavior generally 
occurs. When word q is acoustically very different from word i, then 
di(k) is generally large [compared to d of eq. (4)] for all values of k, 
and the overall distance score D of eq. (3) is large. This case is 
illustrated in Fig. 3b. However, when we have acoustically similar 




Fig. 3 — Curves of ddk) versus k for three cases. 
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words, then generally di (k) will be approximately equal to d q (k) for all 
values of k in acoustically identical regions, and will be larger than 
d g (k) only in acoustically dissimilar regions. An example in which the 
dissimilar region occurs at the beginning of the word (the first # 
frames) is shown in Fig. 3c. 

The key point to be noted from the above discussion is that when 
the vocabulary contains words that are acoustically similar, and one of 
these similar words is spoken (i.e., it is the test utterance), then the 
total distance scores for these similar words consists of a random 
component [because of the variations of d(k) in the similar regions] 
and a deterministic difference (because of the differences in the dissim- 
ilar regions). In cases when the size of the dissimilar region is small 
(i.e., A? <sc N in Fig. 3c), then the random component of the distance 
score can (and often does) outweigh the true difference component, 
causing a potential recognition error. For highly complex vocabularies 
(e.g., the letters of the alphabet), this situation occurs frequently. 

One possible solution to the above problem would be to modify the 
overall distance computation so that more weight is given to some 
regions of the pattern than others. For example, we could consider a 
weighted overall distance of the form 

I W(k)d(?(k),Ri(w(k))) 
D(T,&)-— g , (5) 

S w(k) 

k-i 

where W(k) is an arbitrary frame weighting function, and the denom- 
inator of eq. (5) is used for distance normalization. The problem with 
eq. (5) is that a "good" weighting function is difficult to define since 
the "optimal" set of weights is clearly a function of the "actually" 
spoken word (q) and the reference pattern being used (i). Furthermore, 
any weighting that would help discriminate between acoustically sim- 
ilar words, would tend to hurt the discrimination between acoustically 
different words. 

The above discussion suggests that a reasonable approach would be 
a two-pass recognition strategy in which the first pass would decide on 
an ordering of word "equivalence" classes (in which sets of acoustically 
similar words occurred), and the second pass would order the individ- 
ual words within each equivalence class. For the first-pass recognition 
an unweighted (normal) distance would be used, and for the second 
pass a weighted distance would be used. In order to implement such a 
two-pass recognizer, a number of important questions must be an- 
swered, including: 

(i) How do we "automatically" choose the word equivalence classes 
for each new vocabulary? 
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(m) How do we determine class distance scores for the first recog- 
nition pass? 

(Hi) How do we determine weighting functions for the second 
recognition pass? 

(iv) How do we generate weighted distance scores for the second 
recognition pass? 

(v) How do we combine results from both recognition passes to give 
a final, overall set of distance scores and word orderings? 

Some possible answers to each of these questions are given in the 
following sections. 

2. 1 Generation of word equivalence classes 

Given the V vocabulary words V\, Vz, • • • , Vv, we would like to find 
a procedure for mapping words into acoustic equivalence classes tfy, 
j = 1, 2, • • • , J, where J < V. There are at least two reasonable 
approaches for solving this problem; one is a theoretical approach, the 
other an experimental one. 

For the theoretical approach we can generate a "word-by-word" 
distance matrix D w , on the basis of the phonetic transcriptions of the 
vocabulary entries. In order to do this we need to define a "phoneme" 
distance matrix, d p , a distance cost for inserting a phoneme, di, and a 
distance cost for deleting a phoneme, do. The phoneme distance matrix 
could be a count of the number of distinctive features that have to be 
changed to convert from one phoneme to another. 13 A total word-by- 
word distance is then defined by a dynamic time-warp match between 
the words, with a vertical step representing an insertion, and a hori- 
zontal step representing deletion. Figure 4a illustrates this procedure 
for the words eight and J, and Figure 4b for the words one and nine. 
For the words eight and J, the optimum path is an insertion (of J), 
match between e 1 and e 1 , and a deletion of t, giving a distance 

j/ u r a * + d p (e J , e 1 ) + d D ... 

d(e t, Je 1 ) = , (6a) 



whereas for one and nine, the optimum path is a straight line giving 

z . d p (w, n) + dp (a, a 1 ) + d p (n, n) 
d(w a n, na'n) = . (6b) 



It should be clear that once d p (pi,Pi), dj, and do are defined, the word- 
by-word distance scores can be generated. 

A second approach to obtaining word-by-word distance scores is to 
use real tokens of the vocabulary words and do the actual dynamic 
time warping of the feature sets and obtain actual word distances. If 
several tokens have been recorded, averaging of distances increases 
the reliability of the final results. 
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Fig. 4 — Examples illustrating "word" alignment based on dynamic time warping. 

From the word-by-word distance matrices, word equivalence classes 
may be obtained using the clustering procedures of Levinson et al., 14 
in which the vocabulary words are grouped into clusters (equivalence 
sets) based entirely on pairwise distance scores. 

As an example of the use of the above techniques, consider the 39- 
word vocabulary consisting of the 26 letters of the alphabet, the 10 
digits, and the 3 command words stop, error, and repeat. These 39 
words become clustered into the sets 







Tokens 


*1 = 


{B, C, D, E, G, P, T, V, Z, 3, repeat}, 


11 


<f>2 = 


{A, J, K, 8, H}, 


5 


#3 = 


{F, S, X, 6}, 


4 


<$>4 = 


{I,Y,5,4}, 


4 


</>5 = 


{Q,U,2}, 


3 


</>6 = 


{L, M, N}, 


3 


<i>i = 


{0}, 




<t>8 = 


{R}, 




<h = 


{W}, 




<£io = 


{stop}, 




<#>11 = 


{error}, 




(JH2 — 


{0}, 




<J>13 = 


{1}, 




<}>\4 = 


{7}, 




^>15 = 


{9}. 
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We discuss this vocabulary and the resulting equivalence sets a great 
deal more in Section III. 

2.2 Determination of class-distance scores 

Once all the vocabulary words have been assigned to one of the J 
classes, the first recognition pass estimates an ordering of the word 
classes in terms of class-distance scores. The class-distance scores can 
be determined in one of two ways. First they can be computed as the 
minimum of the word-distance scores, for all words in the class, i.e., 

d(to) = min D(t, %) , j = 1, 2, . . . , J. (7) 

v i e fy 

This computation is similar to the one used by Aldefeld et al. 15 for 
directory listing retrieval. 

An alternative method of obtaining class-distance scores would be 
to obtain "class-reference" templates (as well as word-reference tem- 
plates) and to measure distance directly from the class-reference 
templates. Clearly with multiple templates per class, the /f-nearest 
neighbor (knn) rule can be used as effectively for class templates as 
for word templates. 

The reason for considering class-reference templates for obtaining 
the class-distance scores is that the number of word classes is clearly 
smaller than the number of words. Hence, the number of distance 
calculations required to establish class-distance scores is generally 
much lower for class templates than for word templates. For example, 
for the 39-word vocabulary discussed previously, there are 15 word 
classes. Hence there is almost a 3 to 1 reduction from words to word 
classes. However, it should be clear that the danger in using class 
templates is that errors in determining class distances can be made 
from the reduced number of templates. This point will be discussed 
later in this paper. 

2.3 Choice of weighting functions for the second pass of recognition 

The output of the first recognition pass is an ordered set of word 
class-distance scores. For the second recognition pass, all words within 
the top class (or classes) are compared to the unknown test-word 
pattern (T) using a weighted distance of the type discussed in eq. (5), 
and an ordering of words within the class is made. If several classes 
have similar class distance scores, the words within each of these 
classes are ordered in the same manner. 

The key question that remains is how do we choose the weighting 
function, W(k), of eq. (5) in an optimal or reasonable manner. The 
reader should recall, at this point, that the optimal weighting function, 
W(k), is assumed to be a function of the pair of indices i (the reference 
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pld/ltj) 




m, m-i d »• 

Fig. 5 — Simple Gaussian model for frame distance distributions. 

word) and 7 (the proposed test word). Hence if there are L words in an 
equivalence class, then there are L(L — 1) sets of weighting functions 
[the cases i =j have W(k) = 1]. 

We have investigated two ways of determining W(k) for the second 
recognition pass. Optimality theory says that to maximize the weighted 
distance of eq. (5), 16 the value of W(k) should be 



W(k) = 1 k = k , 

= all other k, 



(8a) 
(8b) 



where k is the index where the distance between ft, and T is, on 
average, the maximum. In this manner, the algorithm places all its 
reliance on the single frame where one would expect the maximum 
difference between reference and test patterns to occur. In practice, 
this weighting does not work since the variability in location of the 
frame k = k of eq. (7) is large. Hence, on several trials the distances, 
using the weighting of eq. (7), can vary considerably. 

A more effective manner of determining a good (but not optimal) 
set of weights is as follows. Consider the model for the distribution of 
distances for a single frame as shown in Fig. 5. The curve on the left 
in Fig. 5 is the assumed distribution of distances in the case when 
i = j (i.e., the reference and test patterns are from the same word). In 
this case, we expect ax 2 distribution with p (order of the lpc model) 
degrees of freedom for the frame distance. For convenience, we model 
this distribution as a Gaussian distribution with mean mi and standard 
deviation o\* 

For the case when i j^j (i.e., the reference and test patterns are from 
different words), we assume the frame distance has a Gaussian distri- 



* This assumption is reasonable since the word distance, which is a sum of frame 
distances, has a Gaussian distribution (by the central limit theorem), and the actual 
probability of word error is directly related to the word distance. 
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bution (as shown to the right in Fig. 5) with mean m 2 and standard 
deviation a 2 . 

We now make a simple recognition model that says the probability 
of recognition error for the word is proportional to the probability of 
error for single frames (since the word distance is the sum of frame 
distances). Then, based on the model of Fig. 5 with assumed Gaussian 
statistics, the probability of correct classification (i.e., finding a smaller 
frame distance for the spoken word, than for any other word) for a 
single frame is 

P(C) = I P[p(d/ H ) = X].P[p(dM > X] d\, (9) 



-1 



where P[x] is the probability of the event x occurring. Equation (9) 
says that the probability of correct frame classification is the integral 
of the probability that for the correct word (i = /) we get a frame 
distance A, and for the closest incorrect word (i ^ j) we get a frame 
distance greater than A. Thus the probability of a frame error is 

P(E) = 1 - P(C), (10) 

which becomes 

P(E) = 1 - I N[X - mi, cti] N[tj - m 2 , <r 2 ] rfrj d\, (11) 

which can be put into the form 

/-<m2-m,)/(o?+ai)" 2 ^ , „2/ \ / m m \ 

PiE) . f '*»<-* /2) dx = Erf fa^a) . (12) 

The form of eq. (12) can be verified for the simple cases m 2 = i»i, 
where P(E) = 0.5, and m 2 » m h where P(E) -> 0. 

The above discussion suggests that a reasonable choice for frame 
weighting would be 



(du(k))- (djdk )) 



w<ik)= ,w r"rr'yr , <i3> 



where duih) is the local distance between repetitions of word i for 
frame k, and dji(k) is the local distance between spoken words j and i 
for frame k, and where the expectations are performed statistically 
over a large number of occurrences of the words u, and Vj. 

By way of example, Fig. 6 shows examples of plots of (dji(k)) versus 

748 THE BELL SYSTEM TECHNICAL JOURNAL, MAY-JUNE 1981 



S' 



- 








(a) 


\ -. V vs / 
















«!*** 




_ — -s**" 


1 1 1 1 1 1 1 II 1 1 


Mill 


1 1 1 1 1 1 1 1 1 1 


1 1 1 


1 1 1 1 1 1 1 1 1 



< 1 - 



(b) 




I I I I I I I I I I I I I I I I I I I I I I I I I I 



(c) 



.- A. J, K. 8 vs A 



I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 
1 40 

FRAME NUMBER (*) 

Fig. 6 — Examples of frame-by-frame distances for words within word equivalence 
classes. 

k and W(k) versus k for some typical cases.* Figure 6 shows a series 
of plots for the following cases: 

(i) (Fig. 6a) Curves of (dji(k)) and o- </,,<*> for the case where word i 
was the letter 2, and wordy was the letter Y. We can see that (du(k)) 
(the solid curves) is approximately constant whereas (dji(k)) differs 
from (du{k)) only at the beginning of the word (i.e., the first eight 



* The data of Fig. 6 were obtained from about 10,000 comparisons for each word, i.e., 
a large data base was used. 
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frames). We also see that the curves of ai^k) (the dashed curves) are 
comparable for the cases j = i and for j ¥* i, with only small differences 
occurring in the first eight frames. 

(ii) (Fig. 6b) Curves of (dji(k)) and <xj..(*) for the case where word i 
was the letter A, and where j corresponded to the letters J and K for 
word 8. Similar behavior to that of Fig. 6a is seen, in that (du(k)) is 
approximately constant, and (dji(k)) is larger than (du(k)) at the 
beginning of the word, for words J and K, and at the end of the word, 
for word 8. For the word 8, the curve of oa^m is also fairly large at the 
end of the word, indicating the high degree of variability in the plosive 
release of the word 8. 

(Hi) (Fig. 6c) The part shows the results of averaging the data of 
Fig. 6b over all j ^ i with j in the class of word i, i.e., class-weighting 
templates. In this case the curve of (dji(k) } shows flat behavior except 
at the beginning (due to J, K) and end (due to 8). If storage of word- 
weighting curves is burdensome, the use of class-weighting curves 
could be considered as a viable alternative. 

Figure 7 shows a set of two weighting curves W J,l (k) for the words 
J and Y. Figure 7a shows the weighting curve for reference word J and 
test word Y, and Fig. 7b shows the weighting curve for reference word 
Y and test word /. Several interesting properties of the curves should 
be noted. First we see that W J,i (k) generally consists of a large pulse 
(for these examples this occurs near k = 1) and a residual tail. The tail 
is a measure of the statistical noise level, i.e., the statistical difference 
between (dji(k)) and (du(k)) in the region of acoustical similarity. 
Typically the peak amplitude in the tails is less than 10 percent of the 
peak amplitude in the main pulse. 

Another interesting property of the weighting curves is that there is 
no symmetry, in that 

W u (k) * W u (k). (14) 

An explanation of this behavior is given in Fig. 8, which shows two 
plots of dynamic time-warping paths for the words / and Y, where it 
is assumed that the word Y is simply the word /with a prefix phoneme 
/ w /. Figure 8a shows that when J is warped to Y, there is a discrepancy 
region in which the /wf is being warped to the initial region of the 
/a 1 / and large distances result. The /a 1 / is warped to itself (the 
"ideal" path) and no further distance is accumulated. Figure 8b shows 
that the discrepancy region is considerably smaller when Y is mapped 
to I. The resulting weighting curves agree in form with the results 
given in Fig. 7. 

2.4 Generation of distance scores for the second recognition pass 

We have now shown how to assign words to classes, how to get class 
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REFERENCE: J 
TEST: Y 




FRAME NUMBER I*) 

Fig. 7— Weighting curves for comparing the words /// and / Y /. 

distance scores for the first recognition pass, and how to assign weights 
for pairs of words within a word class. The next step in the procedure 
is the determination of the distance for the second recognition pass 
based on the pairwise weighted distance scores. 

To see how this is accomplished, we define a pairwise weighted 
distance Dj,i as 

N 



D U = 



X w»(k)di(k) 



(15) 



where i is the index of the reference pattern (i.e., one of the words in 
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Fig. 8 — An example showing why the word weighting curves are not symmetrical. 

the equivalence class) andy is the (assumed)Jndexj)^trie fe 1»sf'pattern 
(again one of the words in the equivalence class). 

The quantity Dj,t of eq. (ISHTcpmpu^ed for all i, j pairs (with i *6j) 
in the word class with-rfnnimum class distance, and a matrix of pairwise 
distances^ is obtained. The word distance, D„ can be obtained in one 
of two ways, namely: 

(i) Averaging over they* index, giving 



A- 2 Dj,i. 



(16a) 



j 

Mi 



(ii) Finding the minimum over they index, i.e., 



Dt = min {D,,,}. 
/ 

Mi 



(16b) 



The advantage of averaging is that Di tends to be more reliable, since 
averaging is equivalent to adding weighted distances over a larger 
number of frames than would be used for a single comparison. The 
rriinimum computation is useful, especially when several of the Dj,i are 
about the same. We examine both these scoring methods in Section 
III. 

For the case of averaging pairwise distance scores [eq. (16a)], the 
computation can be carried out more efficiently as follows. By combin- 
ing eqs. (15) and (16a) we get 
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ri w j, (k) ddk)} 
^ 1 "-' 

I W Ji (k) 



2 £/H£e«*e^ (17b) 



j *=1 



2 w>-«(*) 



*=i 



ZS i -!—\di(k) (17c) 

*-i / 



= I W'(k) di(k), (17d) 



1 



where 



l^W-£ W W, " (A) • (18) 

Thus, for L words in the equivalence class, we can compute Z), with N 
multiplications and additions [rather than the N{L — 1) computations 
of eq. (16a)], and only L vectors of N averaged weights [W' (k)] need 
be stored, rather than L(L — 1) vectors as implied by eq. (15). 

Another variation on the distance weighting that was studied here 
was the effect of applying a nonlinearity to the weighting function, 
W J,t , before computing Dj,u The nonlinearity was to replace W J,t (k) by 
W li {k), defined as 

[0 otherwise, 

where 

WMAx = max[W' , '(*)] > (20) 

and T is a threshold which is specified in the algorithm. The nonline- 
arity of eq. (19) truncates (to 0) the weighting curve whenever its 
relative amplitude falls below the threshold. Figure 9 illustrates a 
typical curve W J '(k) and its truncated version W J, '(k). The new 
weighting function was then applied directly in eq. (15) in place of 
W ji (k). Clearly, when T = 0, W J -'(k) and W Ji (k) are identical. Again, 
when averaging is used, the computation of eq. (17) gives a reduced 
set of weights. 
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FRAME NUMBER [k) 




(b) 



40 

FRAME NUMBER (*) 

Fig. 9 — An example of a weighting curve and its truncated version. 



2.5 Overall distance computation 

If we can make the assumption that the probability of a class error 
on the first recognition pass is significantly smaller than the probability 
of a word error on the first pass, then the final distance for each word 
of the minimum class is the distance obtained on the second recogni- 
tion pass. However there are applications in which it is desirable to 
have a distance score for every word in the vocabulary. Hence, in these 
cases, it is necessary to combine the ordering from the second pass, 
with the distances from the first pass. The basis for such a strategy is 
that distances on the first pass are statistically more reliable than 
distances on the second pass, whereas order statistics (within the class) 
are more reliable on the second pass than on the first pass. One very 
simple way of combining distances and word orders is to obtain second- 
pass ordering for every word in the vocabulary (i.e., apply the method 
of Section 2.4 to all word classes), and then reorder the word list using 
distances from the first pass, and ordering within the class from the 
second pass. 

2.6 An example of the use of the two-pass system 

To illustrate this entire procedure, Tables I to III show an example 
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Table I — Recognition results for a simple example 
(first pass) 









Word 










Word 


Position 




Class 


Word 


Word 


Distance 


First 


Class 


Distance 


Index 


Class 


First Pass 


Pass 


Number 


First Pass 


1 


1 


0.47 


4 


1 


0.47 


2 


3 


0.39 


2 


2 


0.66 


3 


3 


0.51 


5 


3 


0.37 


4 


2 


0.72 


10 






5 


3 


0.42 


3 






6 


1 


0.60 


6 






7 


1 


0.67 


9 






8 


2 


0.83 


12 






9 


a 


0.37 


1 






10 


2 


0.78 


11 






11 


2 


0.66 


8 






12 


1 


0.62 


7 







of the recognition steps for a 12-word vocabulary with three word 
equivalence classes. Table I shows the results of the first recognition 
pass. The class distance scores are assigned as the minimum word 
distance for words within the class. The "best" class in the first pass 
is class 3 with a distance score of 0.37, with class 2 having a somewhat 
higher distance of 0.47. In the second recognition pass the words within 
the best class (or classes) are compared using the optimally determined 



Table II — Second recognition pass results for the example in Table 



l 

6 

7 
12 



Dj, 



1 


J 

6 


7 


12 






X 


0.43 


0.52 


0.47 




0.47 


0.57 


X 


0.62 


0.62 


0.60 


0.72 


0.75 


X 


0.60 


0.69 


0.60 


0.57 


0.63 


X 


0.60 



A(avg) 



1 
3 
4 
2 
Order 



Class 1 



10 
11 



D M 



4 


8 


10 


11 






X 


0.87 


0.82 


0.85 




0.85 


0.80 


X 


0.84 


0.86 


0.83 


0.92 


0.77 


X 


0.91 


0.87 


0.78 


0.80 


0.80 


X 


0.79 



A(avg) 



3 
2 
4 
1 
Order 



Class 2 



2 


J 
3 


5 


9 






X 


0.33 


0.25 


0.28 




0.29 


0.47 


X 


0.67 


0.50 


0.55 


0.45 


0.56 


X 


0.57 


0.53 


0.27 


0.37 


0.30 


X 


0.31 




D 


.1 




1 


fc(avg) 



1 

4 

3 

2 

Order 



Class 3 
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weighting functions. The results for each of the three classes are shown 
in Table II. In practice, one would usually need to compute the D,.« 
scores only for the best one or two classes. However, for explanatory 
purposes, results are shown for all three classes. Also, as discussed 
above, in the case of distance averaging, the Dj,i scores need not be 
computed since the Di scores can be obtained directly via eqs. (17) and 
(18). Using the technique of averaging leads to the within-class dis- 
tances and orderings as shown in the table. Finally, Table III shows 
the results of reordering the words using the distances obtained from 
pass 1, and the within-class orderings obtained from pass 2. Thus word 
2 is the best recognition candidate (with a distance of 0.37), whereas 
word 9 was the best recognition candidate at the end of the first pass. 
Other, within-class reshufflings of word position occur as a result of 
the two recognition passes as shown in Table I. 

2. 7 Summary of the two-pass recognizer 

Figure 10 shows a block diagram of the full two-pass isolated word 
recognition system. In the first pass a dtw distance is computed 
between the unknown test word and the reference templates for each 
word class. The outputs of the first pass are ordered sets of word 
distance scores and class distance scores. 

For the second pass a set of pairwise weighted distances is deter- 
mined for all words within each word class with suitably low scores on 
the first recognition pass. The final recognition output is a combination 
of distance scores from the first pass and word orderings from the 
second pass. In the next section we demonstrate how this procedure 
works in some practical recognition examples. 

Table III — Overall word 

positions and distances for 

the example given in Tables 

I and II 



Word 


Word 


Word 


Index 


Position 


Distance 


1 


4 


0.47 


2 


1 


0.37 


3 


5 


0.51 


4 


11 


0.78 


5 


3 


0.42 


6 


7 


0.62 


7 


9 


0.67 


8 


10 


0.72 


9 


2 


0.39 


10 


12 


0.83 


11 


8 


0.66 


12 


6 


0.60 
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OR CLASS 
TEMPLATES 



TEST WORD 



DTW 
DISTANCE 



FRAME-BY-FRAME 

DISTANCE SCORES 

d/lkl 



DETERMINE 
CLASS 

DISTANCES 

AND 

CLASS 

ORDER 



WEIGHTS 

FILE 

W'-' (k) 



WEIGHTED 

DISTANCE 

COMPUTATION 



REORDER 

LIST OF 

WORDS, 

DISTANCES 



FIRSTPASS ■*• I SECONDPASS 

Fig. 10 — Block diagram of the overall two-pass recognizer. 



III. EVALUATION OF THE TWO-PASS RECOGNIZER 

To test the ideas behind the two-pass recognizer, we used a data 
base of existing recordings. The word vocabulary consisted of the 
V = 39 word vocabulary of the letters of the alphabet, the digits (0 to 
9), and the three command words stop, error, and repeat. The 
training data for obtaining word and class reference templates, and 
pairwise word weighting curves, consisted of one replication of each 
word by each of 100 talkers (50 men, 50 women).* The word reference 
templates (12 per word) were obtained from a clustering analysis of 
the training data. 14,6 A set of "class" reference templates (12 per class) 
was obtained from a second clustering analysis in which the words 
within a class were combined prior to the clustering. The pairwise 
word weighting curves were obtained by cross-comparing all word 
tokens within a word class, averaging the time-aligned distance curves, 
and computing both the averages and standard deviations for each 
frame. 

To test the performance of the overall system, two test sets of data 
were used. These included: 

1. tsI— 10 talkers (not used in the training) spoke the vocabulary 
one time over a dialed-up telephone line. 

2. ts2 — 10 talkers (included in the training) spoke the vocabulary 
one time over a dialed-up telephone line. 

Two sets of performance statistics were measured. For the first 
recognition pass the ability of the recognizer to determine the correct 
word class was measured. For the second recognition pass the improve- 
ment in word recognition accuracy (over the standard one-pass ap- 
proach) was measured. The results obtained are presented in the next 
two sections. 



* All results presented here are for speaker independent systems. 
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Fig. 11 — Plots of class accuracy as a function of the number of templates per word 
(Q), class position (C), and knn rule (knn) for a 15-class vocabulary. 



3. 1 Class recognition accuracy for the first pass 

The ability of the recognizer to determine the "correct" word class 
of the spoken word was measured using both word templates (and 
obtaining class-distance scores from the word-distance scores as dis- 
cussed previously), and class templates (obtaining class-distance scores 
directly). The number of templates per word (or per class) varied from 

I to 12 in the tests to see the effects of the number of reference 
templates on the class accuracy. The if-nearest neighbor (knn) rule 
was used to measure class scores with values of knn = 1 (minimum 
distance), knn = 2 (average of two best scores), and knn = Q (average 
of Q best scores), where Q was the total number of templates used per 
word (or per class). 

The results of the class recognition accuracy tests are given in Figs. 

II and 12.* Figures 11 and 12 show plots of class error rate (based on 
the top C classes) as a function of the number of templates per word 
(Fig. 11) or templates per class (Fig. 12), for values of knn = 1 and 2, 
and for C = 1 (top candidate), C = 2 (two best classes), and C = 3 
(three best classes). Figure 11 shows results when each class is repre- 
sented by word templates, and Fig. 12 shows results when each class 
is represented by class templates. 



* The reader should note the difference in vertical scales between Figs. 11 and 12. 
758 THE BELL SYSTEM TECHNICAL JOURNAL, MAY-JUNE 1981 



Several interesting observations can be made from Figs. 11 and 12. 
These include: 

(i) The knn = 1 rule performs consistently better than the knn = 
2 rule for class discrimination, for all values of C and Q. This result is 
in contradiction with the results of Rabiner et al. 6 who found signifi- 
cantly better performance for knn = 2 than for knn = 1. The 
explanation of this behavior is that the knn = 2 rule provides signifi- 
cantly improved, within-class discrimination (at the expense of slightly 
worse between class discrimination), and that when the only function 
is to determine the class, the knn = 1 rule is superior. In fact when the 
knn rule was used with a value of knn = Q (i.e., averaging over all Q 
reference templates), the class accuracy on the first candidate de- 
creased by about 20 percent — a highly significant loss of accuracy. 
This result again demonstrates that the rninimum distance rule 
(knn = 1) is best for class discrimination. 

(ii) The use of word-reference templates provides significantly better 
performance than obtained from class-reference templates. For ex- 
ample, the class error rate for the top three classes (C = 3) with Q = 
4 templates per word is essentially 0; whereas the class error rate for 
the top three classes with four templates per class is about 4 percent. 
This result shows clearly the importance of representing each word in 



50 




4 6 9 

NUMBER OF TEMPLATES PER CLASS (O) 



Fig. 12— Plots of class accuracy as a function of the number of templates per class 
(Q ), class position (C), and knn rule (knn) for a 15-class vocabulary. 
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the equivalence class by an adequate number of word-reference tem- 
plates. 

(Hi) With six templates per word, error rates of about 4 percent 
(C = 1), 1 percent (C = 2), and percent (C = 3) are obtainable, 
indicating that the full contingent of 12 templates per word is unnec- 
essary for proper class determination. Using 6, rather than 12 templates 
per word reduces the computation in the first recognition pass by 50 
percent. If we always use two or more word classes, the required 
number of templates per word for the first pass can be reduced to four, 
with no serious loss in class accuracy. 

The results shown in Fig. 11 indicate that high accuracy can readily 
be achieved in determining the correct equivalence class for each word 
in a very complex vocabulary. Hence there would appear to be no 
problems in implementing the first pass of the recognition system. 

3.2 Within-class word discrimination for the second pass and overall 
performance scores 

The two-pass word recognizer was tested on the words of tsI and 
ts2. For each test set a total of 390 words were used (39 words X 10 
talkers). For tsI, the word recognition accuracy (for the best candi- 
date) on the first pass was 78 percent, and for ts2 (with talkers from 
the training set) the word recognition accuracy on the first pass was 85 
percent. At the output of the second pass, the word recognition 
accuracy for the best candidate [using the averaging technique of eq. 
(16a) and assuming the correct word equivalence class was found] was 
84.6 percent for tsI and 88.5 percent for ts2, representing potential 
improvements of 6.6 percent and 3.5 percent, respectively. The reason 
that a larger improvement in accuracy was obtained for tsI data than 
for ts2 data was that the accuracy on the first pass was lower for tsI 
than for ts2 (where the talkers were in the training set) and hence 
there was more room for improvement within the word classes. 

Figures 13 and 14 show plots of the changes in accuracy that are 
obtained for tsI (Fig. 13) and ts2 (Fig. 14) data when a threshold is 
imposed on the distance scores at the output of the first recognition 
pass. The threshold specifies that the second recognition pass is 
skipped if the distance of the second word candidate is more than the 
threshold greater than the distance of the first word candidate. Clearly 
this procedure is a strictly computational one, since low-distance scores 
for a single word on the first pass are highly reliable indicators that no 
second pass is necessary. The data plotted in Figs. 13 and 14 show the 
percentage of cases where the actual spoken word comes in a lower 
position on the second pass than in the first pass within the word class; 
it also shows the percentage of cases when the spoken word comes in 
a higher position on the second pass than the first pass, and the 
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0.02 



0.04 0.06 

THRESHOLD 



Fig. 13 — Percentage improvement, decrease, and the resulting difference in word 
position at the output of the second recognition pass for TSl data as a function of the 
distance threshold using the averaging method. 

difference (the improvement) between the two curves. All the results 
are plotted as a function of the distance threshold for performing the 
second-pass computation. It can be seen from these figures that the 
two-pass recognizer is not ideal, i.e., there is a significant fraction of 
words for which a worse position results at the output of the second 
pass. However, on balance, it is seen that a real improvement in 
recognition accuracy results, and it is this improvement that makes 
the procedure a viable one. 

A similar set of results obtained using the minimum computation of 
eq. (16b) on the second pass rather than the average computation of 
eq. (16a) are shown in Figs. 15 and 16 for TSl and ts2, respectively. 
These plots show the same information as those of Figs. 13 and 14 for 
the averaging procedure. A comparison of these results shows that the 
averaging computation performs as well as, or better than, the mini- 
mum computation for the whole range of distance thresholds, and for 
both data sets. These results indicate that the averaging method 
provides a small but important statistical stability to the computation. 

3.3 The effect of thresholding on the weighting curves 

We ran a series of tests with the data from TSl and ts2 to investigate 
the effects of applying thresholds to the weighting curves as illustrated 
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Fig. 14 — Percentage improvement, decrease, and the resulting difference in word 
position at the output of the second recognition pass for ts2 data as a function of the 
distance threshold using the averaging method. 



in Fig. 9. The results indicated that poorer performance always re- 
sulted when any significant part of the weighting curve was zeroed out. 
Thus the gain achieved by removing the "statistical" low-level parts of 
the weighting curve was canceled by the "deterministic" loss from the 
rest of the weighting curve. Hence the conclusion was to use the entire 
weighting curve as derived from the statistical model. 

3.4 Computation for the two-pass recognizer 

We have seen in Section 3.3 that word recognition accuracy improve- 
ments of from 3.5 to 6.6 percent result for the 39-word vocabulary 
using the two-pass recognizer. A key question that must be answered 
is what is the cost of the computation for the two-pass system. 

To answer this question we must examine the computation in each 
pass of the recognizer. In the first recognition pass, for a V-word 
vocabulary with Q templates per word, a total of QV dtw comparisons 
are made. For a value of A/ = 40, each dtw comparison requires about 
500 nine-point dot-product computations, so a total rate, R u of 



R^Q.V-500-9 
multiplications and additions are required. 



(21) 
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If we assume that the local distances djAk ) associated with the 
optimum warping paths are saved for each reference template, then 
for each pairwise comparison of the second pass a total of N (typically 
40) multiplications and additions are required. For L words in the 
equivalence class, a total of 

R 2 = L-(L-1)-N (22) 

multiplications and additions are required for the second-pass com- 
putation for a single equivalence class. For the averaging procedure of 
eq. (17), R 2 is reduced to LN multiplications and additions. 

If we assume typical values of V = 39, Q = 12, L = 7, N = 40, we get 
Ri = 2,106,000 and R 2 = 1680, i.e., the computation of the second pass 
is insignificant compared to the first pass computation. Furthermore 
since we can use reduced values of Q for the first pass (i.e., Q = 6 or 
Q = 4) the overall computation can be significantly reduced from the 
standard isolated word recognizer, with the same improvement in 
accuracy! 

IV. DISCUSSION 

The results presented in the preceding section show that improved 
recognition accuracy can be obtained via a two-pass recognition algo- 
rithm. It was shown that the improvements were both global, i.e., in 
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The same results as in Fig. 13 obtained using the minimum method. 
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Fig. 16 — The same results as in Fig. 14 obtained using the minimum method. 



an absolute recognition sense, and local, i.e., within the classes of 
equivalent words. Although the proposed two-pass recognizer has a 
number of possible implementations, it was shown that the best choices 
were to use a reduced set of word templates on the first pass, and to 
use all word classes that had reasonably small distance scores on the 
second pass. 

One of the major issues that remains unresolved in the two-pass 
recognizer is the choice of weighting curve used in the second-pass 
distance computation. The assumed Gaussian model which led to the 
variance-weighted difference of means for the weights is, at best, an 
approximation to the actual situation. Experimentation with modified 
forms of the weighting curve of eq. (13) led to poorer recognition 
performance. Thus, because we lacked a viable alternative, the weight- 
ing curve of eq. (13) is the only one we investigated for use in the two- 
pass recognizer. 

An interesting question that arises as a result of this study is how 
could this two-pass recognizer aid in practical recognition tasks. As 
one would anticipate, the answer to this question is that it depends on 
the specific recognition task. For example, for the backtracking direc- 
tory listing retrieval system of Rosenberg and Schmidt, 17 the improve- 
ment in recognition accuracy could provide significant reductions in 
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search time. However, for the search procedure of Aldefeld et al., 16 the 
increased word accuracy would have no effect on the search time, but 
could increase the name accuracy, especially when similar names exist 
in the directory (e.g., T. Smith and P. Smith). For applications like the 
airlines reservation system of Levinson and Rosenberg, 18 the increased 
word accuracy would reduce the load on the syntax analyzer; however, 
it needn't necessarily increase the overall accuracy of the system. 

The above examples show that the two-pass recognition strategy 
can be useful for some applications, but one must examine carefully 
the specific task before claiming how useful it will potentially be. 

V. SUMMARY 

We have shown that a two-pass approach to isolated word recogni- 
tion is viable when the word vocabulary consists of sets of acoustically 
similar words. The first recognition pass attempts to determine accu- 
rately the class within which the spoken word occurs, and the second 
recognition pass attempts to order the words within the class, based 
on weighted distances of pairwise comparisons of all words within the 
class. We discussed several alternatives for implementing this two-pass 
recognizer, and we made a performance evaluation which showed that 
a reliable class decision could be made based on a reduced set of 
template scores, and an improved word decision could be made from 
weighted pairwise distance scores. 
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